Case-insensitive OH table reads via targeted Spark analyzer rule by pandaamit91 · Pull Request #559 · linkedin/openhouse

pandaamit91 · 2026-04-24T06:45:45Z

Summary

OpenHouse tables preserve column casing as stored in the catalog (e.g. ID, not id). When a Spark job runs with spark.sql.caseSensitive=true, queries that reference columns in a different casing than what the table stores fail with an unresolved attribute error.
- A simple session-level fix (caseSensitive=false in OpenHouseCatalog.initialize()) was considered but rejected: it is session-wide and breaks pipelines that have case-duplicate column names in joins or intermediate DataFrames.
- This PR introduces OHCaseInsensitiveResolveRule, a custom Spark analyzer rule injected via OpenhouseSparkSessionExtensions.injectResolutionRule. The rule renames UnresolvedAttribute nodes to match the stored column casing only for OH table relations, so Spark's own ResolveReferences rule finds an exact match on the next fixed-point iteration. The rule does not modify spark.sql.caseSensitive and has no effect on non-OH tables in the same session.
- This is the read-path complement to the write-path normalization in doUpdateSchemaIfNeeded (separate PR). Together they make all OH table reads and writes case-insensitive end-to-end without mutating table schemas or session config.

Changes

Problem

A Spark job that sets spark.sql.caseSensitive=true causes reads from OpenHouse tables to fail when a query references a column in different casing than what the table stores (e.g. querying id on a table whose schema holds ID). This also breaks
Spark views whose SQL was written against the stored casing.

A previous approach — overriding spark.sql.caseSensitive=false in OpenHouseCatalog.initialize() — fixed the read failure but introduced a session-wide side effect: DataFrames or joins that intentionally had case-duplicate column names would
become ambiguous and throw AnalysisException, and pipelines that legitimately depend on caseSensitive=true would break.

Solution

Replace the session-level override with OHCaseInsensitiveResolveRule, a custom Spark analyzer rule registered via OpenhouseSparkSessionExtensions.injectResolutionRule.

How it works

Spark's analyzer runs resolution rules in a fixed-point loop. Our rule is injected into the same loop and runs alongside ResolveReferences:

Query: SELECT id FROM openhouse.db.t (table stores "ID")

Iteration 1:
ResolveRelations: UnresolvedRelation → DataSourceV2Relation(output=["ID"])
ResolveReferences: UnresolvedAttribute("id") — no exact match → stays unresolved
OHCaseInsensitiveResolveRule: sees "ID" in OH schema, renames UnresolvedAttribute("id") → ("ID")

Iteration 2:
ResolveReferences: UnresolvedAttribute("ID") — exact match → AttributeReference("ID") ✓

The rule:

Scans the plan for DataSourceV2Relation nodes whose backing catalog has an OpenHouse catalog-impl (checked via spark.sql.catalog..catalog-impl — no hardcoded catalog names).
Builds a lowercase → stored_name map from the relation's output columns. Tables where two or more columns share the same case-folded name are excluded — consistent with the server-side write-path guard.
Renames any UnresolvedAttribute whose last name-part case-insensitively matches an OH column to use the stored casing.

The rule does NOT modify spark.sql.caseSensitive. Non-OH tables and intermediate DataFrame operations in the same session are completely unaffected.
Since spark-3.5 bundles spark-3.1's runtime as a dependency, both Spark versions pick up this change automatically.

Relationship to write-path fix

This PR is the read-path complement to the write-path normalization in doUpdateSchemaIfNeeded (separate PR). Together they ensure:

Writes (server-side): write schema normalized to table casing before validation/storage — covers all write clients
Reads (client-side): Spark column references resolve against stored casing without touching session config

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Three tests added to CatalogOperationTest (run via catalogTest Gradle task, backed by a live embedded OH server):

testReadWithCaseMismatchSucceeds_andDoesNotChangeCaseSensitiveConfig: Creates a table with uppercase ID column, sets caseSensitive=true, queries with lowercase id. Asserts the row is returned correctly and that spark.sql.caseSensitive is still "true" after the query — confirming the rule does not mutate session config.
testViewWithCaseMismatchResolvesViaRule: Same table setup, but accessed through a Spark temp view whose SQL references id (lowercase). Asserts the view reads succeed with caseSensitive=true, confirming that view expansion is also covered.
testCaseDuplicateTableIsExcludedFromNormalization: Creates a table with both id (field 1) and ID (field 2). Asserts that an ambiguous reference throws rather than silently resolving to the wrong column — confirming the case-duplicate guard works correctly.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…tion A Spark job that sets spark.sql.caseSensitive=true causes reads from OpenHouse tables to fail when the query references a column in different casing than what the table stores (e.g. querying "id" on a table whose schema holds "ID"). This also breaks Spark views whose SQL was written against the stored casing. Fix: override initialize() in the Spark-layer OpenHouseCatalog to set spark.sql.caseSensitive=false in the active SparkSession immediately after the catalog is initialized. This fires once per Spark application (when the OH catalog is first accessed), guaranteeing case-insensitive column resolution for all subsequent OH table reads, view expansions, and joins — regardless of what the user has configured. Testing (CatalogOperationTest): - testCatalogInitializationForcesCaseInsensitiveReads: sets caseSensitive=true, initializes a fresh OH catalog instance, asserts the setting is overridden to false. - testReadColumnRefCaseInsensitiveAfterCatalogInit: creates a table with uppercase column "ID" via the Iceberg catalog API, then queries with lowercase "id" after catalog re-initialization and asserts the row is returned without error. - testViewWithLowercaseRefResolvesAfterCatalogInit: same setup with a Spark temp view referencing the column in lowercase, asserts view reads resolve correctly after catalog initialization. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the earlier session-level spark.sql.caseSensitive=false override in OpenHouseCatalog.initialize() with a targeted Spark analyzer rule, OHCaseInsensitiveResolveRule, registered via OpenhouseSparkSessionExtensions. Problem with the session-level approach: Setting caseSensitive=false globally for the entire Spark session caused side effects for non-OH tables in the same job: DataFrames or joins that had case-duplicate column names (e.g. from joining tables both having "id") would become ambiguous and throw AnalysisException. New approach — OHCaseInsensitiveResolveRule: The rule is injected into Spark's analyzer fixed-point loop via injectResolutionRule. On each analysis pass it: 1. Scans the plan for DataSourceV2Relation nodes whose backing catalog is configured with an OpenHouse catalog-impl (checked via Spark conf — no catalog name hardcoding). 2. Builds a lowercase->stored-name map from the relation's output columns. Tables where two columns share the same case-folded name are excluded (ambiguous target — consistent with the server-side write-path guard). 3. Renames any UnresolvedAttribute whose last name-part case-insensitively matches an OH column to use the stored casing. Spark's own ResolveReferences rule then finds an exact match on the next fixed-point iteration. The rule does NOT modify spark.sql.caseSensitive. Non-OH tables, joins, and intermediate DataFrame operations in the same session are unaffected. Testing (CatalogOperationTest — catalogTest task): - testReadWithCaseMismatchSucceeds_andDoesNotChangeCaseSensitiveConfig: creates a table with uppercase "ID", sets caseSensitive=true, queries with lowercase "id", asserts the row is returned AND that caseSensitive remains "true" (the session config is not mutated). - testViewWithCaseMismatchResolvesViaRule: same table, a temp view referencing "id", asserts view reads succeed with caseSensitive=true. - testCaseDuplicateTableIsExcludedFromNormalization: table with both "id" and "ID" columns, asserts that an ambiguous reference throws rather than silently resolving to the wrong column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ert server rejection The previous test tried to create a table with case-duplicate columns ("id" and "ID") via the real OH server then assert the rule skips normalization. This always failed with BadRequestException because the server-side schema validation (write-path guard) rejects such schemas at the REST API level. Rewrite the test to assert that catalog.createTable() throws for a case-duplicate schema. This verifies the server-side guard that ensures such tables can never be created in the first place, and documents why OHCaseInsensitiveResolveRule carries a matching defensive exclusion for pre-existing case-duplicate tables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ase-duplicate tables The existing integration test (CatalogOperationTest) can only assert that the OH server rejects case-duplicate table creation. It cannot exercise the defensive guard inside OHCaseInsensitiveResolveRule that skips normalization when a table already has case-duplicate columns (e.g. tables predating server-side validation). Add OHCaseInsensitiveResolveRuleTest using the mock OH server. The test creates a case-duplicate Iceberg table directly via the Java API (bypassing both Spark SQL and OH server validation), then mocks the OH catalog to serve it. The key assertion: With caseSensitive=true, a mixed-case reference "Id" (matching neither "id" nor "ID" exactly) must throw rather than silently resolving to the wrong column. Without the guard, the rule's map would contain "id" -> "ID" (last write wins), so "Id" would be renamed to "ID" and resolve silently. The guard returns an empty map for case-duplicate tables, leaving Spark's ResolveReferences to report an unresolved attribute as expected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cbb330

can you check if this analyzer also helps solve the Write side tests that need client changes which i've called out here:

#562 (comment)

dxichen · 2026-04-28T23:13:27Z

+    val mappings = collectOHColumnMappings(plan)
+    if (mappings.isEmpty) return plan
+
+    plan.transformExpressions {


this scope is global correct? we would need to prevent this rename for non openhouse catalogs (hive) tables as well

Yes, the extensions are applied to hive catalog as well per our spark cluster config

Thanks for the catch @cbb330 and @dxichen

The bug was that collectOHColumnMappings correctly identified OH columns, but plan.transformExpressions was global and it renamed every UnresolvedAttribute in the plan, including those belonging to Hive or other non-OH catalog tables in the same query. Under caseSensitive=true, this would have broken resolution for non-OH tables sharing a case-folded column name with an OH column.

Fix: collectOHColumnMappings now also tracks column names from every non-OH resolved LeafNode (non-OH DataSourceV2Relation, HiveTableRelation, file scans, etc.). Names that appear in any non-OH relation are excluded from the OH mapping. The rename is only applied to names that are unambiguously OH-specific.

… mapping The rule applied transformExpressions to the whole plan tree after building its column mapping from OH relations. In a cross-catalog query (OH table + Hive/other v2 catalog table), if both tables share a case-folded column name, the global rename would corrupt the non-OH table's column reference and break resolution under caseSensitive=true. Fix: collectOHColumnMappings now also tracks column names from all non-OH resolved LeafNodes (DataSourceV2Relation for other v2 catalogs, HiveTableRelation, etc.). Names that appear in any non-OH relation are excluded from the OH mapping before transformExpressions runs — the rename is only applied to names that are unambiguously OH-specific. Also fix testCaseDuplicateTableIsExcludedFromNormalization in CatalogOperationTest, which was incorrectly assertThrows on catalog.createTable: the open-source server has no CREATE-time case-duplicate guard (that lives in li-openhouse's LiSchemaValidator for schema evolution). Reverted to the original intent: CREATE succeeds, but the ambiguous SELECT reference throws — which is what the rule's empty-mapping guard ensures. New test: testCrossCatalogJoin_nonOHTableColumnNotRenamedToMatchOHCasing verifies the fix end-to-end using a testhelper (Hadoop v2 catalog) table alongside an OH table in the same JOIN query. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pandaamit91 · 2026-04-30T21:34:56Z

can you check if this analyzer also helps solve the Write side tests that need client changes which i've called out here:

#562 (comment)

So I have written a different write resolver because of these reasons:

The reads rule operates too early in the wrong node type
OHCaseInsensitiveResolveRule matches UnresolvedAttribute nodes — column references that haven't been resolved yet (e.g. id typed in a SQL string before Spark knows which table it belongs to). By the time a write is being analyzed, the source DataFrame's columns are already fully resolved AttributeReference nodes. There are no UnresolvedAttribute nodes left to rewrite, so the reads rule simply never fires on write plans.
The reads rule can't run before ResolveOutputRelation anyway
Both rules are injected via injectResolutionRule, which places them at the end of Spark's Resolution fixed-point batch. ResolveOutputRelation is a built-in rule that runs inside that same batch — before injected rules get a chance to fire. So even if the reads rule could match write nodes, ResolveOutputRelation would have already thrown "Cannot find data for output column 'ID'" before the rule ever ran.

The write rule sidesteps this entirely with a different mechanism: ACCEPT_ANY_SCHEMA on the table makes outputResolved=true, which causes ResolveOutputRelation to skip OH tables completely. Then OHWriteSchemaNormalizationRule is registered as a post-hoc rule — a separate batch that runs after the main Resolution batch succeeds — so it can do the column normalization without racing against ResolveOutputRelation.

…n, and column-order writes Three classes of failures were present in run 25183190844: 1. IncompatibleClassChangeError (Iceberg 1.5 / Spark 3.5 API changes) - SparkCatalog.loadTable(Identifier) return type changed from SparkTable to Table in Iceberg 1.5; added OHSparkCatalog.java to spark-3.5 module compiled against Iceberg 1.5 (Table return type). - LeafNode changed from class to interface in Spark 3.5; added OHCaseInsensitiveResolveRule.scala to spark-3.5 module so it compiles correctly against Spark 3.5 LeafNode interface. - Added OpenhouseSparkSessionExtensions.scala to spark-3.5 to override the bundled 3.1 version. 2. Branch write corruption (OHSparkCatalog dropped branch field) - OHSparkCatalog.withAcceptAnySchema used new SparkTable(table, (Long)null, false) which silently dropped the branch field from branch-qualified SparkTable instances. All branch writes then landed on the main table. Fixed by choosing the SparkTable(Table, String, boolean) constructor when original.branch() != null, preserving the branch reference. 3. Column-order mismatch for by-name writes (projectByName kept source order) - ResolveOutputRelation (skipped via ACCEPT_ANY_SCHEMA) reorders columns to TARGET schema order. OHWriteSchemaNormalizationRule.projectByName iterated source columns, keeping source order, so Iceberg received columns out of order and rejected the write with "X is out of order". - Fixed projectByName to iterate TARGET columns and produce expressions in target order, matching ResolveOutputRelation's behaviour. Also added case-duplicate-source guard. Additional test fixes: - Updated CTASNonNull tests to expect OHSparkCatalog (not SparkCatalog) as the catalog class. - Removed INSERT from testCaseDuplicateTableIsExcludedFromNormalization: Iceberg 1.5's ReassignIds uses a case-insensitive map that throws on case-duplicate schemas; the SELECT ambiguity assertion fires at analysis time independent of whether the table has data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r all plan nodes plan.transformExpressions only applies mapExpressions to the root plan node's own expression fields (via mapProductIterator), leaving child nodes untouched. For a query like SELECT id FROM v ORDER BY id, the plan is Sort → Project → SubqueryAlias: transformExpressions renamed id→ID in Sort but left Project's id intact, causing an AnalysisException on the next fixed-point pass. Switch to plan.resolveOperatorsDown { case p => p.transformExpressions {...} } which visits every unanalyzed plan node top-down (skipping already-resolved view bodies) and applies the attribute rename to each one. Add test assertions covering SELECT id and SELECT * from both an explicit-column TEMP view and a SELECT * TEMP view over an OH table with uppercase column names. All four assertions now pass on spark-3.1 and spark-3.5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…test Operations.java (SpotBugs): - Replace keySet() + get() with entrySet() iteration in writeBackupDataManifests (WMI_WRONG_MAP_ITERATOR) - Use StandardCharsets.UTF_8 in getBytes() call (DM_DEFAULT_ENCODING) spotbugsExclude.xml: - Suppress RCN_REDUNDANT_NULLCHECK_WOULD_HAVE_BEEN_A_NPE for Operations.prepareBackupDataManifests (SpotBugs 4.x false positive for try-with-resources auto-close null check) - Suppress MS_MUTABLE_COLLECTION_PKGPROTECT for HouseTablesH2Repository.softDeletedTables (test-infrastructure interface field, not reachable by untrusted callers) - Suppress LI_LAZY_INIT_STATIC for OpenHouseSparkITest.getBuilder (test infrastructure with sequential single-threaded JUnit 5 execution) CatalogOperationTest.testCaseDuplicateTableIsExcludedFromNormalization: - Handle both enforcement behaviors: some server deployments reject case-duplicate schemas at CREATE TABLE time (BadRequestException); others allow creation and rely on Spark's AnalysisException for the ambiguous column reference. Use a try-catch so the test passes in both environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ention) The testNestedStructField_normalizedCaseInsensitively test was failing because the stored schema used lowercase "payload" as the top-level column. With caseSensitive=true, ResolveReferences (earlier in the Resolution batch) found the struct attribute by exact case match and immediately threw AnalysisException when the nested field "event_id" didn't match "EVENT_ID" — before OHCaseInsensitiveResolveRule could run. Fix: change the test schema to PAYLOAD/EVENT_ID/NESTED/VALUE (all uppercase), matching the Hive-migration production scenario where every identifier is uppercased. With a top-level case mismatch, ResolveReferences leaves the full dotted reference unresolved (no throw), allowing our rule to normalize the complete path on the same fixed-point iteration. Also documents the batch-ordering constraint in both spark-3.1 and spark-3.5 OHCaseInsensitiveResolveRule Scaladoc. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pandaamit91 and others added 2 commits April 23, 2026 23:41

pandaamit91 changed the title ~~Force caseSensitive=false in OpenHouse Spark catalog initialization to make OH reads always case-insensitive~~ Case-insensitive OH table reads via targeted Spark analyzer rule Apr 24, 2026

pandaamit91 and others added 2 commits April 24, 2026 21:31

cbb330 reviewed Apr 28, 2026

View reviewed changes

dxichen reviewed Apr 28, 2026

View reviewed changes

pandaamit91 and others added 4 commits April 30, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case-insensitive OH table reads via targeted Spark analyzer rule#559

Case-insensitive OH table reads via targeted Spark analyzer rule#559
pandaamit91 wants to merge 9 commits intolinkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-reads

pandaamit91 commented Apr 24, 2026 •

edited

Loading

Uh oh!

cbb330 left a comment

Uh oh!

dxichen Apr 28, 2026

Uh oh!

cbb330 Apr 29, 2026

Uh oh!

pandaamit91 Apr 30, 2026

Uh oh!

pandaamit91 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pandaamit91 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Problem

Solution

Testing Done

Additional Information

Uh oh!

cbb330 left a comment

Choose a reason for hiding this comment

Uh oh!

dxichen Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

cbb330 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

pandaamit91 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

pandaamit91 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pandaamit91 commented Apr 24, 2026 •

edited

Loading